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AESTRACT 



An abundance of Data 3ase Management Systems and 
Query Languages already exist, not to mention these 
which have been, and continue to be proposed. Most 
Data Base Management System surveys focus on the type 
of model used to represent the data, methods of 
access, protection, etc. This paper acquaints the EDP 
manager with the fundamental differences among the 
more significant query languages with emphasis on 
those characteristics which should be considered when 
choosing a guery language. The term query language as 
used here has been expanded to include the entire user 
interface to the data base, and encompasses both data 
sublanguages and stand-alone query languages. 
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1 • BA CKGRO UND 



A. THE DATA BASE - WHAT IS IT? 

A data base is not just a file system. While a data 
base could include a file system, it is much b roa der in 
scope. In general, an automated file system is a continuous 
group of fixed length records, sequentially ordered, which 
are accessed through card readers, tape units, and usually 
slow-speed rotating storage devices. 

The data base came of age with the advent of fast, 
relatively inexpensive random access devices. A data file 
previously tied to and used with a specific application 
program was often unavailable to other users. This meant 
that for each new application a new program would be written 
necessitating a new data file relevant to that application. 
This led to much duplication of data, which, when combined 
with infrequent and inconsistent updating methods, produced 
a predictably large proliferation of redundant, often 
outdated, data files. 

Martin [P.ef. 1] provides the following definition of a 
data base in contrast to traditional file structures: 

"A data base may be defined as a collection of 
interrelated data stored together with as little redundancy 
as possible to serve one or more applications in an optimal 
fashion; the data are stored so that they are independent of 
programs which must use the data; a common and controlled 



approach is used in adding new data and in modifying and 
retrieving existing data within the data base." 

Summarizing, a data base, as compared to a file system, 
reduces data redundancy, proliferation, and inconsistencies, 
permits shared access, and provides improved data integrity 
and comprehensive data protection. 



B. BRIE? HISTORY 



Prior to the age of the computer, data was stored and 
controlled in some form of clerical ledger. Thus manual 
extraction cf information was severely restricted by labor 
costs and the output capacity per clerk. Additionally, such 
systems were subject to a high error rate and were typically 
redundant. 

Of the early attempts at integrating information 
systems, the one most often mentioned is a project developed 
at the Mitre Corporation for the U.S. Air Force Electronics 
System Division. The outgrowth of the project was the 
Advanced Data Management System (ADAM), significant for its 
external data definition facility, which allowed different 
data base applications to use a common retrieval system. 

Early data base systems employed exclusively low-level 
query languages. As Data Base Management System (DBMS) 
technology has developed, there has been a parallel 
development of query languages, not unlike the evolution of 
high-level programming languages during the development of 
modern computer systems. 

Fry and Sibley [Ref. 2] cite three significant families 
of systems developed in the first decade of DBMS technology: 
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the formatted file/GIS family originated at the David Taylor 
Model Basin around 1958; the Bachman/IDS family, an 
Integrated Data Store facility developed at General Electric 
which was noted for its random access storage and high-level 
data manipulation language; and the Postley/MARK IV family 
for the IEM System/360. 

The Data Base Task Group (DBTG) , a CODASYL programming 
language committee formed to extend COBOL to operate in a 
database environment, made reports in 1969 [Ref. 3], 1971 
[Ref. 4], and 1973 [Ref. 5]. These reports generally 
approached the data management question on the basis of 
using two separate languages; the Data Definition 
(Description) Language (DDL) and the Data Manipulation 
Language (DM!) . 
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C. RECENT DEVELOPMENTS 

Considering the dominance of IBM in the data processing 
field, it is hardly surprising that most commercially 
available data base management software systems today run on 
IBM equipment. These include: Information Management 

System/360 (IMS/360) , released by IBM in 1969; ACABAS, 
released by Software AG in 1970; IDMS (Cullinane 



Corporation, 1973); System 2000 (MRI Systems, 1970); and 
TOTAL (Cincom Systems, 1971). 

While the substance of the commercial market today 
remains in the realm of the CODASYL/IMS network/hierarchical 
system approach to D3MS , significant effort in recent years 
has been devoted to relational data bases. A notable effort 
in this area is the Interactive Graphics and Retrieval 
System (INGRES), a PDP-11/h0 based hardware configuration 
installed and running on top of the UNIX Operating System at 
the University of California, Berkeley [Refs. 7, 8 j. 
Another major effort is System R [Refs. 9, 10], a relational 
implementation developed at the IBM San Jose Research Lab. 
System R runs on an I3K/370 usin,g VM/370 and provides a 
complete data base management capability. 

D. MAN- MACHINE 

Almost the entire thrust of recent DBMS proposals 
appears to have been in the direction of relational models 
underlying a non- procedural query language interface. 
Furthermore, each new proposal would seem to be designed for 
a more casual class of user than its predecessor. This is 
in an effort to make the machine perform ever more of the 
thought processes and improve the efficiency of interaction 
with the user. Obviously, as the machine assumes a greater 
role in the interaction, the user’s efficiency increases and 
the costs both in processor time and software escalate. It 
would seem that there must come a point when it will be 
recognized that even the most casual user will always be 
required to possess at least a minimal, rudimentary 
knowledge of the environment in which his queries are to be 
formulated. 



In the next chapter, the terminology associated with 
query languages and data bases is presented. Chapter III 
presents proposed criteria for measuring and selecting a 
query language. Chapter IV will present the characteristics 
of some of the better known, more widely used, or more 
interesting query languages. The final chapter provides 
some further guidance for selecting a query language, 
discusses seme implications of the current trends in DBJ1S 
and examines some of the prospects for future systems. 
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II 



TERMINOLOGY 



A. DATA EASE MODELS 



Much effort has been expended in comparing rhe three 
logical methods of organizing data by DBMS's. The 
advantages and disadvantages of each are well established. 
While the models are not the central issue here, a few words 
will be devoted to discussion of them in order to provide a 
framework from which to explore alternative methods of 
comparison. Martin [Sef. 1] and Date [Ref. 11] provide 
additional material on data base models. 



1 . N etw ork 

In the network approach, typified by systems of the 
CODASYL/DBTG family, record occurrences are represented as 
nodes of a network, chained together by named, directed 
arcs. The arcs present logical links between the entities 
which can be traversed in the specified direction in order 
to navigate through the data base. 

2 • Hierarchy 

A restricted form of the network approach is the 
hierarchical model in which record occurrences are 
represented as nodes of a tree in a strictly owner-member 
(or more traditionally parent-child) relationship. 
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3 . 



Rela ti onal 



In the relational model data is viewed as a group of 
tables or flat files (relations) . Each table is composed of 
rows (tuples) . The order of the columns (attributes) within 
tables is of no significance and no hierarchical or graphic 
relationship exists among the tables containing the data. 

B. DATA INDEPENDENCE 

General users of a data base do not want to be and 
should not have to be concerned with data base 
implementation details such as access methods, character 
representation, or a host of other physical implementation 
and operating system particulars. All they need is a "view" 
of the data that will allow them to formulate queries and 
manipulate data. These users desire an "independence" from 
implementation details. 

These details of access method, character 
representation, floating-point and integer representation, 
pointers, and record blocking are referred to as the 
physical structure of a data base. Freedom from the storage 
and access details gives the user "physical data 
independence". What the user is provided in the place of a 
physical view is a "logical view" of the data. Furthermore, 
it is often advantageous to provide different users with 
individually tailored logical views of the data. To meet 
this need and to give the system added flexibility, the 
following general approach is normally taken. 

A system logical view of the data, termed a schema, is 
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defined. For hierarchical and network based systems the 
schema describes the relationship between record types and 
specifies the contents of record fields (data items) . 
Similarly, it describes the structure of the relations in 
relational systems. Subschemas are then defined which give 
each user, or group of users their own logical view of the 
data. Thus, users are provided with "logical data 
independence" . 

A system that provides true physical data independence 
would allow the physical storage and access details to be 
changed without affecting the logical structure of the data 
(schemas and subschemas) . True logical data independence 
exists only when logical changes can be made to the data 
base without significantly affecting the programs which 
access it [Ref. 2]. With one recent exception, Apple 
[Ref. 12], which is discussed in Chapter IV, logical data 
independence is currently more of a goal of a data base than 
a characteristic. 



C. QUERY LANGUAGE DEFINITION 



A data definition facility must exist to translate the 
schema and subschemas into a form usable by the data base 
system. A data manipulation facility is required to allow 
data in the system to be deleted, changed, and manipulated. 
The data definition facility, or Data Definition Language 
(DDL) , describes the details and content of the schema and 
subschema tc the system. The data definition language may 
be a separate language available only to the Data Base 
Administrator (DBA) , or it may be an extension of an 
application programming language or query language. There 
may, in fact, be two DDL's: one to define the schema and 
another to define subschemas. Alternatively, portions of 
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the data definition facility used for defining subschemas 
may exist as an extension to a query language; for example, 
the DEFINE VIEW statement of "embedded" SEQUEL allows the 
user to create a view against which he may issue queries or 
define other views [Ref. 9]. 

The data manipulation language in addition to having a 
query capability, characteristically provides the facility 
to update, create, and remove data base entities. Other 
operators typically include COUNT, SUM, MAX, MIN, AVG, 
relational operators, boolean operators, and inclusion 
operators. Like the data definition facility, the data 
manipulation facility may be an extension of a host 
application programming language; in such cases it is 
referred to as a data sublanguage (DSL) and is said to be 
"embedded" in rhe host language. The data manipulation 
facility may also exist as a stand-alone query language 
through which the user interacts directly with the DBMS; 
some authors [Ref. 10] limit the use of the term "query 
language" to languages of the stand alone variety. No 
distinction will be made here between a DSL and a query 
language except to point out that the latter is generally, 
though not necessarily, less procedural (this term will be 
defined later) . The term query language will be used to 

refer to both, and is simply defined as the use r int erface 
the data base . 



In evaluating general purpose programming languages, 
consideration is normally given to the following: syntactic 

clarity, data structures, control structures, operators, 
efficiency of program execution, and, more recently, 

efficiency cf program design, efficiency of problem 
solution, and compatability with top-down programming 

techniques [Ref. 13]. To use these same characteristics to 
examine query languages amounts to looking at query 



languages only through the eyes of a programmer. 
Application programmers would probably be quite content with 
the "procedural" query languages selected using these 
criteria; however, everyone who wishes to interface directly 
with a data base is not a programmer. 
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III. USEE^UERY LANGUAGE COMPATIBILITY 



Having recognized the importance of identifying the 
various user groups, this chaprer defines five classes of 
user. The differences among these classes and the changing 
relative importance of them establish the need for measures 
of query languages. Two quantitative and three qualitative 
measures are proposed. 



A. CLASSES OF USERS 



Codd [Ref. 14] divides the users of data bases into 
five classes. 



1 • System Analy sts /DBA ' s Staf f 

The system analysts are responsible for maintaining 
the data base management system, a function which includes 
creating or altering logical views of the data. 

2. Appl ica tion Progra m mme rs 

The application programmer serves as the middle-man 
for most of today's data processing needs; his function 
should be limited to designing and optimizing frequently 
executed routine queries or those queries that are 
inappropriate for more non- procedural query languages (more 
will be said about this later) . 
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3. 



On -l ine Job- trained U se r 



This group includes bank tellers and insurance 
company clerks who use the data base to answer routine 
queries on a random basis. The needs of this group are 
structured in nature, allowin'g most of their queries to be 
formalized. An example might be, "What is the balance of 
James M . Simpson’s checking account?" 



4 • Resear che rs 



This class of users is quite diverse but their 
queries could probably be characterized as being ad hoc and 
requiring aggregate results. Users in this group are most 
likely willing to wait a few hours or even days for an 
answer. 



5. Ca su al User 



Some authors [Ref. 15] seem willing to extend this 
terra to include almost anyone; it is not unlikely that by 
the 1990’s this may well be justified. However, at present 
a practical need is seen to limit this term to users such as 
managers, lawyers, analysts, accountants, and planners. 
These people need the information in the data base to help 
them make decisions but prefer not to encounter the expense 
or experience the delay in going through a third party, 
perhaps an application programmer, to process their queries. 



Not only is there a varied group who use, or would like 
to interact with a data base, but the distribution of 
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interactions by the five classes is changing. In the early 
seventies, most data base interactions were by application 
programmers. In the next two decades the number of 
interactions by this group is expected to be less 
significant; a corr esponding increase in the significance of 
interactions by on-line job-trained users seems imminent. 
3ut, by far, the casual user class seems to be the emerging 
dominant force. 

Why is the role of the casual user on the rise? An 
increasingly large number of people have recognized the 
value of the data base and the lest opportunity costs that 
result from not being able to interface with it at a level 
in which the machine properly compliments man’s decision 
maxing. In many instances availability of the proper query 
language could eliminate the need to go through the 
application programmer to answer a query, which seems 
particularly desirable in an era of increasing manpower 
costs. In many data base systems, such as military command 
and control systems, having the ability to answer queries as 
close as possible to the level at which critical decisions 
are made can be of particular importance. Some commercial 
vendors view developing the casual user market as a matter 
of survival [Ref. 16], 

Note that the need to cater to those at the casual end 
of the user spectrum in no way implies that the languages 
that are used comfortably by the more computer oriented 
users do not and will not continue to play an important role 
in data base system interface. 



B. THE NEED FOR MEASURES 



What are the differences which exist among the classes 
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of user that affect the type of query language with which 
they would be comfortable? There are certainly differences 
in the understanding of how a computer works, internal data 
representations, operating systems, and programming 
experience. The latter is of particular importance since 
non-programmers may not be acquainted with such notions as 
data structures, looping, branching, and program efficiency. 
There are quite likely differing levels of mathematical 
sophistication. Differences in the number of times a user 
interacts with a data base and the interval between 
interactions are significant as an indication of the amount 
of time and effort a user can devote to learning a query 
language. The infrequent user may experience difficulty 
retaining syntactic details of some query languages. 

In short, these differences point to the need for query 
languages compatible with users of varying qualifications 
and varying needs. This may imply having several query 
languages running concurrently on a data base. The more 
casual users do not desire to know and should not be 
required to learn the structure of the underlying data 
model, access methods, or programming control structures. 
Query languages giving users freedom from these details will 
require the machine to play a greater role in the 
man-machine symbiosis. 

C. MEASURES OF QUERY LANGUAGES 

With the need well established, qualitative and 
quantitative measures of query languages are presented. 
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1 . Quantit ati ve Measures 

Level and completeness are considered quantitative 
measures. It is not essential that an evaluator apply these 
measures directly to a language being evaluated; it is, 
however, recommended that the concepts embodied in them be 
at least subjectively applied. References 17 and 18 discuss 
other measures of '’software physics". 



a. Level 



Level [Ref. 17] is a quantitative measure of 
the amount of decision making that goes into the formation 
of a program to solve a problem. The number of decisions 
required to solve a given problem can be profoundly affected 
by the language being used. The user may be forced to make 
many decisions concerning syntax, delimiters, operators, 
etc., which are of little or no significance to the problem 
itself. Specifically, level is a mathematically derived 
value, between zero and one, based on the number of 
operators and operands used in the most eff ici e nt sol ution 
of a problem in a language. 

As an example (though perhaps extreme and 
certainly a misuse of COBOL) of the difference in level, 
consider the following problem: program in COBOL and in 
APL, the matrix multiplication of matrices A and B. The 
result in APL is A+. xB. The amount of code generated to 
solve the same problem in COBOL is obviously much greater. 
In this example APL would have a level close to one while 
COBOL would have a level close to zero. 

An evaluator may want to generate some benchmark 






algorithm to actually determine the level of languages under 
consideration. Those languages that consistently have low 
levels will genera lly prove to be procedural in nature, have 
lower query design efficiency, hold the user responsible for 
exception or error checking, depend on the user for insuring 
efficiency of execution; and are thus less suited to the 
casual user. The opposite characteristics are ge n erally 
true for languages yielding high values for level. 

One word of caution is in order. Languages that 
yield a high value for level do not necessarily relieve the 
user of the responsibility for execution efficiency. APL is 
a case in point; the order of operation may have a profound 
effect on execution efficiency. 



b. Completeness 



Completeness refers to the selection capability 
of a query language, independent of any host language in 
which it may be embedded. A complete query language allows 
an authorized user to extract any data item that is 
semantically contained within a data base. A-crually 
completeness is not a quantitative measure, but is included 
here because of its theoretical basis in mathematics. Codd 
[Ref. 19] established the basis for completeness of 
relational algebra and relational calculus. Thus the 
completeness for any language based on relational algebra or 
relational calculus may be established by determining if it 
permits the expression of any query expressible in the 
relational calculus. Recent work [Ref. 20] suggesting the 
equivalence of the three data models makes it appear 
reasonable to attempt to determine completeness for 
languages designed for hierarchical and network models. At 
the very least, an intuitive judgment of completeness should 
be obtained. 
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2 . Quali ta tive Meas ures 



Mathematical sophistication, learnabili ty, and 
procedurality are proposed and discussed as qualitative 
measures cf a query language. 

a. Mathematical Sophistication 



Mathematical sophistication is a subjective 
measure of the degree to which a language requires a user to 
be familiar with mathematical concepts, terminology, and 
symbology . 



There is much to be said for languages that have 
a strong theoretical foundation in mathematics being used as 
target languages for user-oriented source languages. 
Relational algebra, discussed later, is often used for this 
purpose. The degree to which the relational algebra 
operators (projection, restriction, etc.), used to 
manipulate data, are visible in the source language is an 
indication of the amount of mathematical background required 
by the user. While terms such as restriction and projection 
are not well known, the actions accomplished by them are 
certainly more natural than that of an algorithm which uses 
a sequence of operations acting on one element at a time to 
select data. However, when these operations are "visible" 
at the user level, the user must mentally go through the 
mathematical operations necessary to extract the data. Is 
it reasonable for a casual user, such as a lawyer, to go 
through a mathematical process to formulate a query? Should 
he really have to do more than describe what he wants? 



The presence 



or 



mathematical 



terms such as 



"range" and symbols such as C • , A -V may be other 

indicators of the amount of mathematical sophistication 
required to use a language. 

Languages requiring little or no mathematical 
sophistication will tend to be non-proced ural. 

b. Learnability 



In looking at the learnability of a language, 
one should be interested in the time and effort required to 
learn a workable subset to answer simple queries; this 
working set should meet the needs of many users. The 
restrictions and exceptions that must be mastered to compose 
moderately complex queries may be a better indication of 
learnability. The ability of casual users to retain what 
they have learned between infrequent uses must also be 
considered. 



Human factors studies even more rigorous than 
those conducted on SQUARE and SEQUEL [Ref. 21] and Query by 
Example [Ref. 22] may be necessary to accurately identify 
elements that affect learnability. These studies support 
the notions that simple concepts should be used and that 
those concepts which differ semantically should also differ 
syntactically to avoid causing confusion. Reference 21 
found that two different uses of the term WHERE were 
confusing; this ambiguity was eliminated in a later version 
of the language [Ref. 9]. Similarly Ref. 22 found that 
subjects made mistakes on one-fourth of the occasions when 
they needed to choose between the COUNT, SUM, COMPACT-COUNT, 
and AVERAGE operators. It was also noted that a significant 
percentage cf users had trouble with the universal 
quantification constructs - for all, and there exists. A 
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smaller number even had difficulty with the relational 
operators <, >, <, >. 

languages that require less mathematical 
background and are non- procedural tend to be more easily 
learned by a broad class of users. Ideally a human factors 
study should be conducted on the specific group for which a 
language is intended. 

The ability of a user to recall what he has 
learned can *be enhanced if the user is given an explicit 
format with which to formulate his query. For example, in 
SEQUEL [Ref. 16] queries are structured as: 

SELECT <whax> 

FROM <relation name> 

WHERE <condition>. 

The user is less likely to recall the operators necessary to 
construct a query in languages containing more freedom of 
form. It is possible for an implementation, particularly an 
interactive one, to assist the user by providing prompts 
(such as query formats on a CRT screen) or by providing 
menus. Interactive input devices, such as lightpens, 
tablets, and cursors, that allow the user to specify his 
requirements in the most natural way may help appreciably. 

c. Procedurality 

Several of the measures described above have 
made reference to the procedurality or non- procedural ity of 
a language. Many of these measures have a close correlation 
to the degree of procedurality of a language. Languages 
fall along a procedural — non- procedural spectrum just as 
users span a broad spectrum of backgrounds and needs. This 
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spectrum cannot be described in absolute terms. Languages 
fall along this spectrum based on the following factors, 
several of which may apply in varying degrees. 

(1) User Specified Acce ss Path 

Languages that require the user to provide 
a detailed access path to locate the desired data are highly 
procedural. In such languages the user must have a thorough 
knowledge of the underlying logical data organization. In 
hierarchical models the user must literally start at the 
root node and walk down the tree, node-by-node, until he 
reaches the required record type. In network models there 
may be several possible access paths (network models allow 
records to have more than one parent) . Since there is no 
guarantee the user will pick the optimum access path, the 
efficiency of the query's execution may be adversely 
affected. Languages with this characteristic will have a 
low level and generally require that queries be composed by 
application programmers. The efficiency of problem solution 
is, therefore, low. Additionally, the DBA has less freedom 
in restructuring the underlying schema; doing so might 
require the rewriting of user programs that run on a 
recurring basis. In relational models different relations 
do not have parent-member relationships. Therefore, 
specifying the logical access path is not characteristic of 
relational query languages. 

(2) User Conduct Elemen t - by - el ement Se ar ch 

Languages that require the user to program 
detailed handling of each record or tuple characteristically 
have go to, branching, and looping control structures. Once 
the user arrives at a node (in a hierarchical model) he must 
check each record element-by-element to determine if it 
meets the search criteria. The user is also responsible for 
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handling exceptions and error conditions. Naturally, 
languages requiring element-by-element search are quite 
procedural, have a low level, and are meant for users with 
programming experience. 

(3) User Sp ec ified Loqica 1 Order of Operations 

In languages exhibiting the preceeding two 
characteristics the user specifies the sequence in which 
operators are executed. However, this trait is identified 
separately because there are languages in which the user is 
not required to specify the access path or process records 
(tuples) individually but which do require him to logically 
apply operators in a specified sequence to locate data. 
Languages using terse relational algebra notation have this 
requirement. The term logical was used because the system 
may optimize the actual order of execution; if not, the user 
affects efficiency. Languages falling on this part of the 
spectrum are not limited to application programmers, but 
generally do require a significant degree of mathematical 
sophistication . 

(4) User Affects Ord er of E xe c uti on 

Language implementations exist in which the 
user does not actually specify the order of operations, but 
where he can significantly affect it. This feature will be 
discussed further in the examples presented in Chapter IV. 

(5) User Knowledge of Data Base Con tent 



At the extreme procedural end of the 
spectrum the user must have detailed knowledge of access 
paths. Toward the center of the spectrum relational 
languages still require knowledge of what attributes are in 
what relations; this knowledge must be used to "navigate 
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across relational boundaries" [Ref. 12]. Network and 
hierarchical languages in the center require knowledge of 
what data item is in specific files. At the non- procedural 
end the user is required only to know attribute or data item 
names. some language implementations accept predefined 
synonyms or derivatives of actual data names. Such 
implementations' are said to be "user-friendly." 



Many attempts have been made to give adjectives 
to languages along the spectrum. Very procedural languages 
in which the user must have detailed knowledge of the data 
base in order xo specify an element-by-elemenx path through 
the data base are called "navigational" languages [Ref. 23]. 
Languages along the center of the spectrum , generally 
exhibiting factors xhree and four, are "prescriptive" 
languages. Languages at the non- procedural end, simply 
requiring the user to state attribute names, are 
"descriptive". Languages requiring no knowledge of the data 
base are termed "open-ended". The special class of query 
languages which allow the user to specify queries using 
tables, forms, and geometric images are also described using 
these adjectives. 



In contrast to the procedural languages, the 
non-procedural languages require little knowledge of the 
data base or its underlying model; require less source code 
(higher level); allow the system to determine access paths; 
allow information to be addressed by content (as opposed to 
location); require less mathematical sophistication; free 
the user from error handling and execution efficiency 
considerations; are more efficient in total problem 
solution; and are easier to learn and retain. They are 
generally, though not necessarily, interactive. They also 
require another layer of software and thus may require more 
machine time. The extra layer of software gives the DBA 
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more freedom in changing the logical structure of the data 
base. 

It is important to note that every 
ncn-procedural query language must be supported by an 
underlying procedural language. And it is reasonable to use 
the criteria presented in Chapter One for selecting general 
purpose programming languages as a starting point for 
evaluating a non-procedural query language. 
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IV 



THE ^USRY LMGUAGE spectrum 



basis 



of 



their ability 



anguages is presented in 


order to 


discussed in Chapter 


III. 


No 


imit the sampling to com 


mer daily 
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on 


the 


lity to demonstrate 


various 


two examples do not 


conform 



syntactically to any actual languages, but are used to 
represent a large group of navigational languages. 

The sample query used throughout the examples is applied 
to the data base in Figures 1 and 2. The sample data base 
was extracted from Ref. 10. The sample query, 

Q1 : LIST THE ELECTION YEARS IN WHICH A REPUBLICAN 

FROM CALIFORNIA WAS ELECTED. 

is solved in all examples. 



A. PROCEDURAL - NETWORK 



The following example uses Figure 2 and is 
representative of languages such as IMS [Ref. 24], 
COD AS YL/DBTG [Ref. 4], and IDMS [Ref. 25]. This language 
models a network language, but a hierarchical procedural 
language would employ similar constructs. A typical query 
for Q 1 might be f ormulat ed as follows: 
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ELECT 10 NS -WON 



YEAR 


WINNER-NAME 


WINNER-VOTES 








1952 


Eisenhower 


442 


1956 


Eisenhower 


447 


1960 


Kennedy 


303 


1964 


Johnson 


4 86 


1968 


Nixon 


301 


1972 


Nixon 


520 



PRESIDENTS 



NAME 


PARTY 


HOME-STATE 








Eisenhower 


Republican 


Texas 


Kennedy 


Democrat 


Mass 


Johnson 


Democrat 


Texas 


Nixon 


Republican 


Calif 



Figure 1 - SAMPLE DATA BASE - RELATIONAL 
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# 



PRESIDENTS • 




# 



Figure 2 - SAMPLE DATA BASE - NETWORK 
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Q 1 ; MOVE •'Republican'* TO PARTY AND "Calif" TO 

HOME-STATS IN PRESIDENT 
FIND PRESIDENT RECORD 
if failure go to error 
LOOP1 : FIND MEMBER OF P-EW SET 

if none go to error 
PRINT YEAR 

LOOP2 : FIND NEXT MEMBER OF P-EW SET 

if.none go to LOOP3 
PRINT YEAR 
go to LOCP2 

LOOP3: FIND NEXT PRESIDENT 

if none "done" 
go to LOOP1 

The query resembles a program in a simple general purpose 
programming language. Answering a query in this language 
amounts to writing a program that "navigates" through the 
data element-by-element. Indeed the language was designed 
to be used by programmers. It has low level and requires 
detailed knowledge of the data base. 



B. PROCEDURAL - RELATIONAL 



This example, using Figure 1, was included to 
demonstrate that low level relational languages such as XRM 
[Ref. 26] and GAMMA-0 [Ref. 27] have characteristics in 
common with network/hierarchical procedural languages. 
Those minor differences which do exist are not significant. 



Q 1 : FIND FIRST PRESIDENTS TUPLE 

WHERE PARTY = "Republican" 

AND HOME-STATE = "Calif" 



LOOP1 : 



LOOP2 : 



LOCP3 : 



if failure; return "no such president" 
save name 

FIND ELECTIONS- WO N TUPLE WHERE 
WINNER-NAME = saved name 

if failure; return "president exists that did 
not win election" 

PRINT YEAR 

FIND NEXT ELECTION-WON TUPLE WHERE 
WINNER-NAME = saved name 
if none: go to LOOP3 

go to L00P2 

FIND NEXT PRESIDENTS TUPLE WHERE PARTY = 
"Republican" AND HOMS-STATE = "Calif" 
if none "done" 
go to LOOP1 



Although there is no access path to specify, the 
programmer must still use the knowledge that the President’s 
name was common to two relations in order to navigate across 
relational boundaries and extract YEAR. 



C. RELATIONAL ALGEBRA 

These languages and the one used in the following 
example are based in set theory. They were proposed by Codd 
when he first introduced the relational model [Ref. 6] and 
later when he further defined a relational algebra and 
relational calculus [Refs. 19, 28]. Languages based on 

relational algebra include MACAIMS [Ref. 29], IS/1 
[Ref. 30], and RDMS [Ref. 31]. The query Q1, using Figure 



1 , is presented in the relational algebra followed by a 
simple description of the operators used. A more complete 
description of relational operators is provided in Ref. 10. 

*(0) (ft[ (1 = 3) J^ELSCTIONS-WON* (# (0) (%[ ( 1="Republican" 

& 2="Calif") PRESIDENTS) ) ) ) 

Processing begins at the innermost nested level. 
Domains are numbered left to right beginning with zero. 

RESTRICTION (ft) of a relation amounts to selecting all tuples 
(rows) that meet the conditions defined by the restriction 
of the relation specified. Thus 

%[ ( 1="Republican" S 2 = "Calif") ] PRESIDENTS 

selects all tuples from the relation PRESIDENTS in which 
the attribute (column) number "1" has the value "Republican" 
and attribute number "2" has the value "Calif". The result 
is a new relation. A, which serves as the operand for the 
next operator. A has the value: 



NAME 


PARTY 


HOME-STATE 








Nixon 


Republican 


Calif 



PROJECTION ( #) extracts specified attributes. The result is 
a new relation. Thus # (0) A selects the name attribute from 
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A yielding B 



NAME 



Nixon 



PRODUCT (*) concatenates its left and right arguments. If N 
and M are the cardinalities of the relation, the result is a 
new relation having N x M tuples. ELECTIONS-WON * 5 yields 
C: 



YEAR 


WINNER-NAME 


WINNER- VOTES 


NAME 










1952 


Eisenhower 


442 


Nixon 


1956 


Eisenhower 


447 


Nixon 


1960 


Kennedy 


303 


Nixon 


1964 


J ohnson 


486 


Nixon 


1968 


Nixon 


301 


Nixon 


1972 


Nixon 


520 


Nixon 



To this result a restriction is applied in which WINNER-NAME 
= NAME to eliminate ambiguous information such as the tuple 



37 



"1952, Eisenhower, 442, Nixon". 



The result is: 



YEAS 


WINNER-NAME 


WINNER- VOTES 


NAME 










1968 

1972 


Nixon 

Nixon 


301 

520 


Nixon 

Nixon 



A projection yields the final answer: 




Actual iiplementations might use a more descriptive 
syntax in order to improve readability and learnabilit y. 

The significant difference between this and the two 
preceeding examples is that the user specified what he 
wanted in terms of sets not in terms of individual records. 
The underlying system took care of locating the relations, 
doing element-by-element processing, including error 
handling and storing intermediate results (perhaps 
virtually). The language is not navigational; it is 
essentially prescriptive. However, the user still had to 
specify the operations to be performed and the order in 
which they were to be performed. Some implementations may 
optimize the query by (1) finding an equivalent re-ordering 



of the operators, (2) performing some operations virtually, 
(3) using a different combination of operators, (4) using a 
different set of relations and attributes, or (5) a 
combination cf the above. Obviously, in many cases the user 
determines or affects the execution efficiency of his query. 



The relational algebra does not require the programming 
skills of the first two examples, but the mathematically 
unsophisticated user might find it cumbersome, at least in 
its terse form. 



D. RELATIONAL CALCULUS 



In t 
wants u 
into an 
Relation 
followed 
calculus 



he relatio 


nal 


calculus 


the user 


specifies 


what he 


sing logi 


cal 


operators. The 


system maps 


the query 


equivalent 


expression 


in the 


relational 


algebra. 


al calcul 


us 


may b e 


though t 


of as restrictions 



by projections. Q1 stated in the relational 
uses Figure 1: 



{x: 3 ( (y, Republican, Calif) 6 Presidents /\ (x,y , z) £ 
ELECT 10 NS -WON) } 



The query specifies what is wanted, x; where it comes 



from, ELECTIONS- HON ; 


and 


with 


what 


qualif 


icat ions . 


Paraphrasing, Select 


x from 


(x,y. 


z) , an 


eleme 


nt of (£) 


ELECTIONS-HON, where z 


has any 


value 


and (A) 


y is t 


aken from 


(y , Republican,Calif ) , 


an element of 


President. X, 


y, and z 



are variable; Republican and Calif are constants. 

The relational calculus is prescriptive. The user 
relies on his knowledge of what attributes are in what 
relations tc prescribe a solution to his query. It is the 
system's responsibility to determine the exact sequence of 
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operations to be used. As with the relational algebra, it 
is still possible that using some decomposition algorithms 
the user may have a profound influence on execution 
efficiency. This is particularly true in implementations 
that use join as one of the operators in the underlying 
algebra (Pecherer [Ref. 32] has shown that certain subsets 
of the relational algebra are sufficient) . 

As discussed earlier many users have difficulty with the 
universal and existential qualifiers as well as set 
notation. This could limit the number of users who would 
feel comfortable with this language. 



E. QUEL 



QUEL [Ref. 33] is typical of languages which are based 
on the relational calculus. Other such languages include 
ALPHA [Ref. 28], COLARD [Ref. 34], and RIL [Ref. 35]. 

These languages do not require the user to apply quantifiers 
directly . 

QUEL is actually a query language as well as a DSL which 
is embedded in the "C" language and the UNIX operating 
system using the procedural language EQUEL [Ref. 33]. 

The solution to Q1, using Figure 1, is presented: 

RANGE OF E IS EL ECTIO NS-WON 
RANGE OF P IS PRESIDENTS 
RETRIEVE E. YEAR 

WHERE E. WINNER-NAME = P. NAME 

WHERE P. PARTY = "Republican" AND 
P. HOME-STATE = "Calif" 



The RANGE statements specify the relations used in the 
RETRIEVE and WHERE statements. The requirement for the user 
to have detailed knowledge of relational structure is 
reflected in the block structure of the language. Here 
again the user is prescribing a means to navigate across 
relational boundaries. 

F. CUPID 



References 36 and 37 describe CUPID (Casual User 
Pictorial Interface Design) , which as its name implies is a 
picture-oriented query language. It is intended for the 
"casual" user. CUPID contains a high-level, menu-type 
sublanguage which is the front-end to INGRES, the relational 
data base system supporting QUEL. Additionally, CUPID 
offers a user definition facility which allows the system to 
"learn" new concepts. CUPID is presented here due to its 
graphic nature, which places it in the "special" category of 
query languages. 

Figure 3 illustrates how Q1 might appear on a cathode 
ray tube device. An English language approximation of the 
query as it is depicted here, would be, "Select and save 
Name from Presidents relation where Party equals Republican 
and Home-state equals Calif; select and output Year from 
Elect ions-won relation where Winner-name equals saved Marne." 
The graphic diagram is drawn as a result of user inputs from 
a keyboard. 



In order for the individual graphic symbols to be 
displayed, the user must select from a menu of shapes 
available, those symbols which are necessary to formulate 
his query, and then through interactive queuing, properly 
position each symbol. It would seem that this would, for 



all practical purposes, necessitate the user having sketched 
his intended query, or at least have it well-formed in his 
own mind prior to commencing "construction". There are 
unique shapes available in the menu to represent each of the 
following entities: relation names, domain names, 
relational operators, arithmetic operators, logical 
operators, constants, q-box es (designa tes the target list of 
requested data) , and a special symbol to enclose aggregate 
operations. 




Figure 3 - CUPID QUERY 

this query appears fairly straightforward, some 
comments are in order. CUPID is prescriptive 
user must provide a graphical "prescription" for 



While 

additional 
since the 




Figure 4 - AMBIGUOUS QUERY IN CUPID 

answering a query. It would indeed be difficult to assess 
the learnability of this language without a much more 
thorough examination. Although, one small study reoorted 
favorable results [Ref. 37]. It can, however, be observed 
that the menu-selection feature would most surely stimulate 
the infreguent user's recall. The language requires a 
fairly high-level of mathematical sophistication; even the 
most simple queries generally cannot be formulated without 
boolean operators. Furthermore, it is not clear (at least 
to the casual user) whether the query depicted in Figure 4 
would yield the same results as that in Figure 3; a 
substantial understanding of the formal syntax is also 
required. Considering these factors, it would seem that the 



language would be mors appropriate for the on-line 
job-trained user or perhaps the research category of user. 
As for the level and completeness of CUPID, it would be 
expected that both are very near QUEL, as the pictorial 
queries of CUPID are compiled into that prescriptive query 
language . 

An additional, unique feature of CUPID is its definition 
capability. There are provisions for both user definitions 
and "learning" (through global definition tables) . As an 
example, Figure 5 depicts how a user might formulate the 
query Q2 : "List the names of minority presidents." The 
vocabulary definition algorithms would ideally resolve both 
problems presented by this query: (1) define 
"minority-president" and (2) resolve the apparent 
misplacement of "m inority- president" , an unknown value for 
the winner-name domain. The user would eventually be 
required, through a queuing sequence, -o provide the maximum 
number of votes which would qualify a member of the 
winner-name domain as a minority president. 



Q 2: 




Figure 5 - USER-FRIENDLY QUERY IN CUPID 



With some practical experience, it appears that CUPID 
would be an interesting, tidy method of expressing simple 
queries, and almost fun to use. On the other hand, more 
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complicated queries, such as the examples depicted in Ref. 
36, become quite cumbersome and difficult to fellow. 
Nonetheless, its designer should be commended for her un ique 
efforts in the area cf pictorial (graphical) queries, her 
success with user vocabulary definition, and the 
contribution to reduction of typographic and spelling errors 
in English-language text input by use of the "menu" 

facility. 

G. SEQUEL 

SEQUEL [Ref. 24], an outgrowth of SQ 
typical of what Chamberlin [ 

"mapping-oriented languages". Mapping- 
specify queries by defining a mapping be 
result, which is a relation, and relati 
to exist in the data base. SEQUEL was o 
for interactive problem solving by non-co 
System R [Ref. 31] implements SEQUEL bot 
language and as a DSL callable from appl 
languages. The SEQUEL syntax resembles t 
are block structured and use WHERE stat 
solution of Q 1 using Figure 1 is: 

SELECT YEAR 

FROM ELECT 10 NS -WON 

WHERE WINNER-NAME = 

SELECT NAME 

FROM PRESIDENTS 

WHERE PARTY = "Republican" 

AND HOME-STATE = "Calif" 

Most of the comments about QUEL are applicable here. 
SEQUEL is prescriptive. The syntax of SEQUEL (SELECT, FROM 
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vice RANGE, RETRIEVE) seems more natural and more learnable 
than that of QUEL (see Ref. 21 for human factors study of 
SEQUEL) . The mathematical sophistication required for 
simple queries is quite low. More complex queries may 
require use of the set operators - union, intersection, and 
set difference between intermediate mappings. For example, 
use the relations in Figure 1 to answer Q3: 

Q3 : FIND THE NAMES OF PRESIDENTS BORN IN TEXAS WHO RECEIVED 

MORE THAN 400 W INNER- VOTES . 

An appropriate query might be: 

SELECT WINNER-NAME 

FROM ELECTIONS- WON 

WHERE WINNER-VOTE > 400 

n 

SELECT NAME 

FROM PRESIDENTS 

WHERE HOME-STATE = "Texas" 

Some users may not have the prerequisite background for 
the proper use of the set operators. 

H. QUERY RY EXAMPLE 

Query by Example [Refs. 39, 40] is intended to serve the 
needs of the non- programming casual user with little 
mathematical background. It is presented here as an example 
of a category of query languages generally referred to as 
"forms". MARK IV [Ref. 41], intended for batch usage, was 
an early attempt at the forms approach. Another significant 
example of this approach has been presented by the C0DASYL 
End User Facility Task Group [Ref. 42], which is attempting 
to define a "language" to emulate the naturalness of 



manually extracting data from "forms" which are familiar to 
the user. 

In Query by Example the user formulates his query by 
displaying blank tables on the CRT, naming the tables and 
their columns, and filling in the columns to illustrate the 
query to be answered. The queries are then translated into 
relational calculus for processing. 

The Query by Example solution to Q1 follows: 



ELECT IONS- WON 


YEAR 


WINNER-NAME 


WINNER- VOTES 












P . 1 9 48 


Wilson 





PRESIDENTS 


NAME 


PARTY 


HOME-STATE 












Wilson 


Republican 


Calif 



Republican and Calif are "constant elements" and are not 
underlined. WILSON and 1948 are variables, termed 
"example elements", and are designated as such by 
underlining. Example elements need not be actual elements 
in the data base. "P." specifies that the element is to be 
retrieved and printed. 
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This language is classified as "prescriptive" because 
the user must prescribe the means to navigate across tvo 
"tables" in a manner not unlike that required by SEQUEL. 
Reference 22 presents a human factors study of Query by 
Example. As discussed earlier, universal quantification 
presented a problem for some users. Q4 , taken from 
[Ref. 39], is a query requiring universal quantification. 

Q4 : FIND THE NAMES OF SUPPLIERS WHO SUPPLY A JOB LOCATED 
IN NEW YORK WITH ALL PARTS OF TYPE A. 




PARI 


PART-NAME 


TYPE 










ALL Rod 


A 



JOB 


JOB- NAME 


LOCATION 










Bulb 


New York 









ALL Rod means all PART-NAMSs of TYPE A. The dot '•« 
indicates that a SUPPLIER may supply more than parts of 
TYPE A to a job in New York. 

In comparing this query with that for relational 
calculus, note how the target language shows through. The 



problem with universal quantification was also detected in 
the evaluation of SEQUEL [Ref. 21]. This suggests that the 
level of mathematical sophistication required for a language 
may be easily under-estimated, and that the casual user's 
facility of the language may be limited. 



X. APPLE 

APPLE [Ref. 12] is the language used by a developing 
system which allows the user to specify queries using only 
attribute names. The formulation of Q1 follows: 

SELECT YEAR W HERE PARTY = "Republican" AND 
H CHE-STATE = "Calif" 



This language is truly "descriptive". The user need not 
"navigate" or give a "prescription" for navigating across 
relational boundaries; he need not even know what boundaries 
exist, let alone know what attributes transcend their 
boundaries. All that is required is a knowledge of 
attribute names. The system determines access paths and 
identifies the operators necessary to answer the query. 
APPLE currently has some implementation problems relating to 
the solution of ambiguities within queries, but, when 
satisfactorily resolved, could present the casual user with 
a language that is truly simple to understand and use. 



J. NATURAL LANGUAGE 

The idea of using natural English as a query language is 
not new or unique [Ref. 43]. In fact, Simmons [Ref. 44] 
reviewed fifteen experimental question-answering systems 
more than ten years ago. 
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Instead cf limiting the system to a subset of the 
English language, another approach which is relativel y easy 
to implement is the computer-initiated dialogue. The user 
may experience the sensation of communicating with the 
machine in a very natural manner, when in reality the 
machine forces the use of a restricted subset of the 
language by initiating the dialogue and requiring the user 
to respond in an unambiguous manner which can be clearly 
understood by the machine. 



To 

remains 



implement a genuine, user initiated dialogue system 
a challenge for the following reasons: 



^ . 



and 



Spelling Errors 



Free-form input 
problems with respect 



opens a Pandora's box of potential 
to typographic and spelling errors. 



Some query languages, such as CUPID 
significance of this problem by the 
menu. 



2. Synta ctic and Lexical Am bigu it ie s 



Many grammatically correct Englis 
are inherently ambiguous. For instan 
receiving diplomas by sections,' 1 can easi 
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applied to modify (1) the verb "list' 1 o 
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There can be no doubt that natural English query 
languages will eventually come into widespread usage. In 
the interim, however, it is not unreasonable to expect the 
casual, but interested, user to familiarize himself with the 
environment of the DBMS and one of the currently available 
prescriptive query languages. Surely a cleric who has used 
tape-output adding machines for his entire career would be 
quite bewildered the first time he attempted to use a modern 
hand-held, re verse- pol ish notation calculator. But once 
acquainted with the machine, and its increased computational 
power, it would probably be difficult to persuade him to 
return to his former environment. 
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It is evident that the user must be a 
consideration in selection of a query language. The 
casual the user the mere non- procedural the languag 
be. "However, there will, no doubt, always be users 
interaction rates are so high, whose types of intera 
are limited and whose data structures change slowly 
that they will rationally prefer a procedural s 
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[Ref. 20], The final choice of language (s) must be male 
within the context of other realities. One may be limited 
to existing hardware in which case increased machine time 
may not be available or the desired software interface may 
not exist. 

Currently, the choice of commercial languages is limited 
to those which are designed for network and hierarchical 
models; it may be some time before a relational model is 
commercially available. The importance of the underlying 
model is likely to diminish in light of Stonebraker and 
Held's suggestion that the hierarchical and relational 
models are special cases of the network model [Ref. 20], 
and Date’s proposed architecture for a single high-level 
language which supports all three data models [Ref. 46]. 

Reference 20 suggests another factor that should not be 
overlooked — non- procedural languages are inappropriate for 
certain queries. For example, consider the query: 

Q5: FIND the President receiving the second highest 
HINNER-VOTZS. 

It is unlikely that a non-procedural language processor 
would handle Q5 efficiently. 

All things considered, most general data base 
implementations will require a mix of query languages to 
meet the needs of its various users, while a single query 
language may suffice for some special purpose data bases. 



B. THE FUTURE 

Two factors will impact on the DBMS of the future. 
First, the machine's role in the man-machine symbiosis will 



expand beyond providing data which man analyzes and uses to 
make decisions. The machine itself will be programmed to 
access the data base and conduct increasingly higher levels 
of analysis to assist man in making complex decisions. A 
thorough discussion of "decision support systems" is found 
in Ref. 47 and an experimental system, GMIS, is the subject 
of Ref. 48. 



Secondly, the arrival of mass storage devices on the 
market and the expected arrival of reasonably priced 
associative memory hardware will undoubtedly have a profound 
impact on the future of DBMS's and query languages. Mass 
storage devices make it practical to store large volumes of 
information on-line. Associative memory will surely serve 
as a catalyst for the development of the Data Base Machine 
(DBM) . 
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